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I  he  problem  of  computing  dependability  measures  of  repairable  systems  with  general  failure,  repair 
and  maintenance  processes  is  it  hard  problem  fo  solve  in  general  either  by  analytical  or  by  numerical 
methods.  Monte  Carlo  simulation  could  be  used  to  solve  this  problem,  however,  standard  simu¬ 
lation  lakes  a  very  long  time  to  estimate  system  reliability  and  availability  with  reasonable  accuracy 
because  typically  the  system  failure  is  a  rare  event.  When  the  failure  and  repair  time  distributions 
arc  exponential,  impoi  toner  sampling  has  been  used  successfully  in  the  past  to  reduce  simulation  run 
lengths.  In  this  paper,  we  extend  the  applicability  ol  importonrr  wnplinz  to  non-Markovian 
models  with  general  failure  anti  repair  time  distributions.  We  show  that  by  carefully  selecting  a 
heuristic  for  importance  sampling,  orders  ol  magnitude  reduction  in  simulation  run-lengths  can  be 
obtained.  We  illustrate  the  effectiveness  of  the  technique  hv  modelling  a  large  repairable  computing 
system.  Also,  we  study  the  effect  of  periodic  maintenance  on  systems  with  components  having 
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1.  Introduction 


The  problem  of  computing  dependability  measure*!  of  repairable  systems  with  general  failure, 
repair  and  maintenance  processes  is  a  hard  problem  citliei  by  analytical  or  by  numctical  methods. 
Such  systems,  in  general,  cannot  be  modelled  by  Markov  or  even  semi-Markov  processes  HARP 
[  .1  ]  solves  large  models  with  general  failure  time  distributions  by  creating  a  non-homogcncous 
Markov  chain  model  of  the  system  and  then  solving  the  corresponding  differential  equations  nu¬ 
merically.  The  technique  has  been  applied  to  non-repairablc  systems  only  (transient  recoveries  are 
allowed,  but  they  arc  approximated  by  instantaneous  transitions).  Furthermore,  only  transient 
measures  (c.g.,  reliability)  are  estimated.  CARF.-III  [  IS  ]  uses  numerical  integration  methods  to 
solve  similar  models. 

The  goal  of  this  paper  is  to  model  systems  with  general  failure,  repair  and  maintenance  processes, 
and  solve  them  for  both  transient  (c.g.,  reliability  and  mean  time  to  failure)  and  stationary  (e.g., 
steady-state  availability)  measures.  In  relatively  simple  eases,  one  could  obtain  the  I  aplace  Trans¬ 
form  of  dependability  measures  for  such  models  and  numerically  invert  them  to  obtain  the  desired 
results  f  14  ].  However,  these  methods  are  limited  to  small  models  and  arc  prone  to  unhoundable 
numerical  errors. 

An  alternative  approach  is  to  use  Monte  Carlo  simulation.  The  advantage  of  this  method  is  that 
arbitrary  system  details  can  be  modeled,  and  furthermore,  all  the  system  states  need  not  be  gener¬ 
ated.  Titc  disadvantage  of  this  approach  ts  that  standntd  simulation  takes  very  long  time  to  estimate 
dependability  measures  with  reasonable  accuracy  because  system  failure  events  arc  very  rare  in 
highly  dependable  systems  [  4  ].  When  the  future  ami  repair  time  distributions  are  exponential,  the 
importance  sampling  technique  has  been  used  successfully  in  the  past  to  reduce  simulation  run- 
lengths  significantly  [2,  10,  12  ].  Basically,  the  system  failure  events  are  forced  to  occur  more  often 
by  increasing  the  failure  rates:  unbiased  estimates  of  dependability  measures  are  obtained  by 
multiplying  the  value  of  the  measure  on  a  sample  path  by  the  likelihood  ratio  of  the  sample  path. 
The  likelihood  ratio  for  a  given  sample  path  is  the  ratio  of  the  probability  of  the  sample  path  under 
the  original  distributions  (c.g.,  with  the  original  failure  and  repair  rates)  over  the  probability  of  the 
same  sample  path  under  the  new  distributions  (c.g.,  with  the  new  failure  and  repair  rates). 

In  this  paper,  we  extend  the  applicability  of  importance  sampling  to  non-Markovian  systems  with 
genera!  failure,  repair  and  maintenance  processes.  For  general  dircrcte-evcnt  systems,  importance 
sampling  has  been  discussed  in  [  5,  6  ].  Basically,  a  Generalized  Semi-Markov  Processes  (GSMPs) 
formalism  is  used  to  represent  such  systems,  and  the  likelihood  ratio  of  a  sample  path  is  written  in 
terms  of  the  various  probability  distributions  (c.g  .  failure,  repair  and  maintenance  distributions)  in 
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the  original  and  ihe  new  (simulated)  systems.  However,  in  [  5.  ft  )  they  did  not  consider  the  design 
and  implementation  of  specific  importance  sampling  distributions  that  arc  required  in  older  to  oh 
tain  effective  variance  reduction  in  non-Markovian  models  of  highly  dependable  systems.  One 
possible  way  to  appropriately  implement  importance  sampling,  which  we  propose  arid  use  in  this 
paper,  is  accomplished  by  canceling  and  rescheduling  previously  scheduled  events.  For  example, 
when  one  component  fails  in  a  system  with  a  redundant  component  pair,  we  speed  up  the  failure 
of  the  other  component  so  that  it  fails  with  high  probability  before  the  repair  of  the  first  compo¬ 
nent.  This  involves  cancelling  the  originally  scheduled  failure  event  for  the  second  component  and 
rescheduling  it  using  a  new  failure  distribution  with  a  smaller  mean  time  to  failure. 

In  Sections  2  and  3  we  give  i»  concise  description  of  discrete-event  systems,  which  is  appropriate  for 
our  purpose  in  this  paper;  namely,  to  forms  Hy  represent  the  probability  of  a  sample  path.  This 
yields  a  representation  for  the  likelihood  ratio  which  is  the  key  to  importance  sampling. 

In  Section  4,  we  give  the  basic  estimators  for  some  commonly  used  measures  in  highly  dependable 
systems,  such  as  reliability,  steady-state  availability  and  mean  time  to  failure.  A  simple  example 
of  a  two-components  system  is  used  to  explain  these  measures  as  well  as  the  importance  sampling 
technique  used  to  estimate  them.  In  Section  5,  we  discuss  Ihe  implementation  of  these  methods  in 
a  software  tool  which  we  used  to  generate  and  simu'atr  large  models.  1'his  tool  is  based  on  the 
CSIM  package  [  15.  16  ].  In  Section  6,  we  use  three  examples  to  illustrate  the  effectiveness  of  the 
proposed  importance  sampling  techniques.  First,  we  use  a  small  example  to  experiment  with  some 
heuristics  for  selecting  the  new  probability  distributions  which  make  the  typically  rare  system  fail¬ 
ures  occur  more  often.  Second,  these  heuristics  are  applied  in  a  large  example  to  show  that  orders 
of  magnitude  reduction  in  variance  can  be  obtained.  We  use  exponential  failure  and  repair  dis¬ 
tributions  in  this  example  to  ascertain  the  correct  lies';  of  the  results  obtained  by  comparing  them 
againsl  numerical  results  obtained  ftom  the  SAVF,  package  [  7  ].  In  the  third  example,  we  use 
Weibull  failure  distribution  and  periodic  maintenance  for  till  individual  components  in  ihe  system. 
We  study  the  effect  of  the  hazard  rate  (i.c.,  increasing,  decreasing  and  constant  failure  rates)  on  the 
optimal  maintenance  period.  Such  studies  cannot  he  performed  with  existing  analytical  or  numer¬ 
ical  methods.  In  Section  7  we  give  conclusions  and  some  directions  for  future  research. 

2.  Discrete-Event  Systems 

In  this  section  we  give  some  notation  and  basic  properties  of  discrete-event  systems,  which  will 
assist  in  representing  the  probability  of  a  sample  path  and  the  likelihood  ratio  required  for  impor¬ 
tance  sampling  in  simulations  of  such  systems.  A  precise  mathematical  framework  for  the  study 
of  discrete-event  systems  is  {riven  by  (ilynn  in  [  5  ]:  he  gives  a  generalized  semi-Markov  process 
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(GSM!’)  formalism  ol  discrete-event  system*.  Here,  we  give  an  alternative  concise  description  of 
discrete-event  systems,  which  is  appropriate  and  sufficient  for  our  purpose.  In  our  description  we 
have  left  out  some  of  the  details  and  generalities  which  are  not  needed  for  the  developments  in  this 
paper. 

A  discrete-event  system  is  characterized  hy  a  set  of  events  K  which  can  trigger  transitions  of  its  state 
and  a  set  7,  of  integer-valued  output  state  vectors  (  7.  is  possibly  a  countably  infinite  set).  With  each 
event  eeE  we  associate  a  clock.  The  reading  c(e)  is  the  "remaining  lifetime"  of  clock  r.  i.c.,  the  time 
remaining  for  clock  c  to  expire.  r(e)  =  oo  if  clock  c  is  inactive.  The  choice  of  the  output  (observable 
or  measured)  state  vector  in  a  discrete-event  system  depends  on  the  application  at  hand  and  the 
desired  level  of  detail. 

The  internal  state  of  a  discrete-event  system  at  a  given  time  is  completely  determined  by  its  output 
state  and  the  set  of  active  clocks  (i  c..  the  set  of  events  which  can  trigger  a  transition  to  another 
internal  state)  with  the  associated  clock  readings.  Upon  the  f  —  th  transition,  let  7.,e7.  be  the  output 
state  vector  and  E/SE  be  the  set  of  active  clocks;  Cj(Ej)  is  a  vector  with  the  associated  clock  readings. 
Then  X,  =  (Zj.cyfE ',/))  is  the  internal  state  of  the  discrete-event  system  upon  the  /  —  th  transition. 
Notice  that  the  output  state  and  the  set  of  active  clocks  characterizing  the  internal  state  change  only 
in  response  to  transitions  (events),  while  the  clock  readings  are  continuously  changing  at  the  same 
rate  (in  general,  different  clock  rates  may  be  assumed,  see  c.g..  [  5  ]).  It  is  typical  in  discrete-event 
systems  that  the  output  stale  does  not  change  between  transitions;  for  example,  the  number  of 
customers  in  a  queueing  system.  Therefore,  the  output  state  trajectory  of  a  discrete-event  system 
is  completely  described  by  *';:c  output  state  at  transition  times  of  the  internal  state.  I  ct  ii ►  I),  be 
the  time  of  the  /  -  th  transition,  with  r0  *=  0.  Then  7}  •=  fl+l  -  /,  is  the  time  between  the  i  -  th  and 
the  (/  +  1)  -  th  transitions.  I  .el  7(1)  denote  the  output  stale  at  time  /,  then  (Z(/).  1 t  <>)  is  the  output 
state  trajectory,  and  7.(1)  =*  7.k  ff  lk  l  <  lk+]. 

As  indicated  above,  we  only  consider  the  internal  stale  sequence  at  transition  times,  since  this  is 
sufficient  to  determine  the  output  state  trajectory  of  the  discrete-event  system.  A  sample  path  of 
the  discrete-event  system  up  to  the  n  -  th  transition  is  denoted  by  the  sequence  Xn„  of  internal 
states  at  transition  times, 


=  (XoiX| . X„). 

let  <**  =  argmin[c(c)),  it  0.  Then  r*  is  the  clock  which  triggers  the  (i-f  I)  —  th  transition  and 
7)  =  citj).  The  internal  state  X,+t  =  (7,/+1,  c/+,(E/+,))  upon  the  (/+  1)  —  th  transition  is  deter¬ 
mined  by  the  sequence  \aj  and  may  depend  on  the  complete  history  of  the  system.  The  set  of 
active  clocks  E<+)  upon  the  (i+  1)  •-  th  transition  is  determined  by 
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*'-/+ 1  “  *'•/  _  f't  -  'V  4  N/4  1 

where  Aj  is  the  set  of  clocks  canceled  (aborted)  upon  the  (/'  +  1)  -  th  transition  and  Ni+,  is  the  set 
of  new  clocks  activated  upon  the  (/+  1)  —  th  transition.  The  set  of  clocks  A;  and  N(+1  and  the 
output  state  Z;+1  are  determined  probabilistically,  depending  on  the  trigger  event  c*  and  the  se¬ 
quence  Xn  j.  Therefore,  the  (t+  1)  — th  transition  triggered  by  e*  yields  the  output  state  7.i+1  and 
the  active  set  of  clocks  E/+)  with  a  probability  denoted  by  E/+);  e]).  'I'he  subscript  /+  1 

of  p  symbolizes  the  dependence  on  the  sequence  X0  /  (routing  in  queueing  networks  is  an  obvious 
example  for  the  use  of  these  transition  probabilities). 

We  denote  by  ftc,  e)  (resp.  /•)(<;  c)  )  the  probability  density  function  (resp.  the  complementary  dis¬ 
tribution  function)  of  the  (conditional)  "remaining  lifetime"  of  clock  r«F./  at  the  i  -  th  transition. 
The  subscript  /  symbolizes  the  dependence  of  this  probability  density  function  on  the  history  of  the 
system  through  its  internal  state  sequence  X0j.  For  example,  if  clock  t  was  originally  scheduled 
using  a  probability  density  function  /(•;<•)  and  if  the  age  of  the  clock  at  the  i  -  th  transition  is  a, 
then  the  density  of  the  remaining  lifetime  t  is  /J(<;r)  » f{i  +  a:r)lT'(a\c).  Similarly, 
c)  «  /•■(/  +  o\  t)IF(o'<  e).  If  dock  <*  is  newly  scheduled  at  the  /  -  th  transition,  then  the  age  is 
0,  so  that  yj(/;  <?)  =  J[l\  e)  and  e)  =  T\(;  e) .  I  et  <>/+,  be  the  set  of  old  clocks  which  continue 
to  be  active  upon  the  (/+  1)  -  th  transition,  i.e.,  Oi+1  =  E/+)  -  N/+1,  (SO.  The  clock  reading 
c{( ;),  eeO/+|,  is  updated  as  follows:  c(<?)  «  r(e)  -  7).  Upon  the  (/+  1)  -  th  transition,  the  proba¬ 
bility  density  function  and  the  complementary  distribution  function  of  the  remaining  time  on  clock 
e«Oi+l  arc  changed  to  reflect  the  elapsed  time  on  this  dock,  i.e.,  for  all  ee();+I 


4h(<;«)  *  +  T(-  c)lWb  d . 

(2.1) 

-  W  +  cyr^Tf,  c) . 

(2.2) 

Notice  that  these  modified  distributions  arc  not  needed  to  determine  the  clock  readings 
c(e),  e*0/+1,  since,  as  stated  above,  we  can  use  the  remaining  lifetime  as  the  updated  clock  reading 
for  an  old  clock.  However,  they  are  used  to  describe  the  probability  of  a  sample  path  and  the 
likelihood  ratio,  as  we  shall  see  in  the  following. 

Given  that  the  internal  state  of  the  discrete-event  system  is  X,  at  the  i  -  th  transition,  we  can  write 
the  probability  density  (likelihood)  that  the  next  internal  state  is  X/+,  at  the  (/+  1)  -  th  transition. 
We  denote  this  probability  by  P(X^+|) ,  then 
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P(X/,/+ 1 )  —  ft  h'ei)Pi+ 1  (Z*+ 1 .  k/+ 1 :  ei .  |~|  /‘K  ?}'•  <*)  • 

It  follows  that  the  likelihood  of  a  sample  path  X0  „,  tip  to  the  n  -  th  transition,  is  given  by 


n-l 


(2.4) 
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3.  Importance  Sampling  in  Simulations  of  Discrete-Event  Systems 

In  this  section,  we  discuss  importance  sampling  which  can  be  used  to  obtain  a  significant  variance 
reduction  over  standard  simulation  when  estimating  dependability  measures.  The  basic  idea  of 
importance  sampling  is  to  simulate  the  system  under  diflcrent  probability  dislnout’ons,  so  as  to 
appropriately  and  quickly  move  the  system  towards  failure.  Since  the  simulated  system  is  dynam¬ 
ically  different  from  the  original  system,  a  correction  factor  is  needed  to  compensate  for  the  resulting 
bias.  This  correction  factor  is  called  the  likelihood  ratio  and  must  be  used  with  importance  sampling 
to  obtain  unbiased  estimates. 

Consider  a  simulation  of  a  discrete-event  system  for  the  purpose  of  estimating  the  expected  value 
of  a  particular  performance  measure,  say  M.  Ixt  3f(X0  A)  be  the  value  of  the  measure  on  a  sample 
path  Xo  /y  ,  where  N  is  a  stopping  time  relative  to  the  internal  state  sequence,  i.c.,  I(N  =  l)  is  a 
function  of  X0i/  (/( • )  is  the  indicator  function  which  equals  one  if  its  argument  is  true;  otherwise 
it  equals  zero).  In  the  original  system,  the  likelihood  of  the  sample  path  Xn/V  is  P(  Xn  /Vr )  as  given 
by  liquation  (2.4).  To  implement  importance  sampling,  we  simulate  the  system  with  a  different 
likelihood  !*'(  • )  for  its  sample  path.  l;or  P'(  • ) ,  it  is  necessary  that  the  following  must  hold  for  all 
Xo,/v> 

P'(X0  Af)  >  0  whenever  Af(Xn  v)  P(Xn  v)  >  0.  (3.1) 

Under  P(  • ),  the  expectation  lip(Af)  can  be  expressed  as  follows 

nP(A0»  £  ,w(x0i„)P(X0i/v) 

vxe.# 

-  X  M(XQ'N)L(XnrfV’(X0'N)  (3.2) 

VXojv 

~  I2p>( M  /,) , 
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where  lip. (A/  /.)  is  the  expectation  of  MI.  under  P’(«).  The  likelihood  ratio 
f-(X0,A')  *  P(X0 >w)/I,,(Xa./v)  w  the  ratio  of  the  sample  path  likelihoods  under  the  original  and  the 
new  distributions,  P  and  P* ,  respectively.  The  use  of  the  summation  sign  in  the  above  equation  is 
not  quite  precise,  since  the  sample  space  is  uncountable  infinite,  however,  the  summation  can  be 
interpreted  as  an  integral  with  respect  to  an  appropriate  probability  measure  to  make  this  fully  rig¬ 
orous. 

IvCt/^t;  e),  /■'/(<;  e),  and  E/+1;  e, )  be  the  probability  distributions,  in  the  simulated 

system,  corresponding  to  F^l;  e),  e*E/  and  pt+}(Zl+l,  E;+1;  e*)  in  the  original  system.  These 

distributions  should  be  chosen  appropriately,  so  as  to  favorably  bias  the  dynamics  of  the  system 
wiiile  making  sure  that  the  condition  in  Equation  (3.1)  is  satisfied.  1  he  rules  for  updating  the  new 
quantities, /X  • ;  e)  and  F'^  * ;  <•),  at  transition  times  arc  analogous  to  those  for  updating^  • ;  e)  and 
Ffc  • ,  <*)  given  in  liquations  (2.1)  and  (2.2).  The  likelihood  ratio  associated  with  the  sample  path  \nN 
is  given  by 


N- 1 

/-(X„.*)  *  j~| 


Ml 


Cl  ) 

f  fTt  ci ) 


^4-1'  ) 


n. 

«K*“l*il 


nne)  • 


(3.3) 


The  above  equation  is  the  basis  for  importance  sampling  in  discrete-event  systems.  Rather  than 
replicating  the  r.v.  M  (  Xov  )  under  P  to  estimate  liP(A/),  we  replicate  the  r.v.  M(\0  N)  L(X0  /V) 
under  P'  to  estimate  F,P.(A/  L),  which  is  equal  to  Ep(Af).  When  P'  is  chosen  appropriately,  signif¬ 
icant  reduction  in  the  variance  of  the  r.v.  M  /.,  under  P\  can  be  achieved  (compared  to  the  variance 
of  the  r.v.  M  under  P).  This  choice  depends  on  the  model  at  hand  and  on  the  measure  to  be  esti¬ 
mated. 

Notice  that  Equation  (3.3)  allows  us  to  update  the  likelihood  atio  at  transition  times  in  simple 
multiplicative  manner.  Notice  also  that  at  any  transition  wc  can  actually  change  the  values  of  any 
active  (old  and  new)  clock  according  to  some  chosen,  essentially  arbitrary,  new  distribution.  This 
is  equivalent  to  cancelling  an  active  clock  and  rescheduling  (i.c.,  resampling)  its  remaining  lifetime 
from  the  new  distribution.  We  illustrate  this  by  the  following:  Suppose  that  clock  t.  is  activated  at 
the  /—  th  transition  and  that  wc  assign  a  value  to  this  clock  according  to  the  probability  density 
function  f[  • ;  e) .  At  the  (i  +  1)  -  th  transition,  wc  decide  to  reschedule  clock  <?,  thus  we  assign  to 
its  remaining  lifetime  a  new  value  /  according  to  a  new  probability  density  function  /(•;<)  . 
Further,  we  suppose  that  clock  e  continues  to  run  at  the  (/  +  2)  -  th  transition  and  it  expires  at  the 
(( +  3)  -  th  transition.  In  effect,  clock  t  has  a  total  lifetime  Tt+y\  According  to  Equation  (3.3), 
the  contribution  of  clock  r  to  the  likelihood  ratio  at  the  {/+))-  th  transition  b 
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Using  liquations  (2.1)  and  (2.2),  the  contribution  at  the  (/+  2)  -  th  transition  is 

7  /+|(‘Tf+|i e)  F(7/4  7  e)//**^;  e) 

TiJr^T)  m  W~7)  ‘ 

Using  Equations  (2.1)  and  (2.2),  the  contribution  at  the  (i+  3)  -  th  transition  is 

fi+lsTi+i'  *)  ^  /t7<  + >',;r)//-(7j+  7f+l;  g) 
fi+iO)  f  j;  <0  /O';  r)/P(7',+1 ;  c) 


It  follows  that  the  overall  contribution  of  clock  r  to  the  likelihood  ratio  between  the  i  -  th  and  the 
(/  +  3)  -  th  transitions  is 
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In  the  above  equation,  notice  that  the  numerator  is  simply  the  likelihood  of  the  total  lifetime 
(7 ) + >•')  of  clock  e  under  the  original  probability  density  function  /(•;«),  while  the  denominator 
is  the  likelihood  with  rescheduling. 

As  an  example  to  show  how  importance  sampling  can  be  implemented,  let  us  consider  a  machine* 
repairman  model  with  two  components  of  the  same  type  and  one  single  server  FCFS  repair  facility. 
Each  component  has  general  failure  and  repair  distributions.  The  system  is  initially  operational, 
with  all  components  as  good  as  new,  and  it  continues  to  be  operational  as  long  as  at  least  one 
component  is  operational.  In  a  highly  dependable  system,  a  component's  mean  time  to  failure  is 
usually  several  orders  of  magnitude  larger  than  its  mean  time  to  repair.  Therefore,  a  system  failure 
is  a  rare  event.  Consider  the  estimation  of  a  dependability  measure,  such  as  the  unreliability,  using 
the  replication  method  of  simulation.  Clearly,  if  we  use  standard  simulation,  a  very  large  number 
of  replications  is  needed  to  achieve  a  reasonably  tight  confidence  interval.  This  implies  a  very  long 
simulation  run.  Importance  sampling  is  accomplished  by  biasing  the  dynamics  of  the  system  so 
as  to  make  its  typical  failures  occur  more  frequently.  One  possible  heuristic  is  what  we  call  dynamic 
importance  sampling  (DIS)  [  2  ,  10  ];  it  is  described  as  follows:  as  soon  as  one  of  the  two  com¬ 
ponents  fail,  we  accelerate  the  failure  of  the  second  component,  either  by  rescheduling  it  using  a 
new  (accelerated)  distribution  oi  by  increasing  its  clock  rate.  Increasing  the  failure  clock  rate  is 
equivalent  to  rescheduling  with  a  new  distribution  obtained  by  scaling  the  conditional  original  dis- 
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tribution.  A  reasonable  heuristic  choice  for  the  new  distribution  is  obtained  by  appropriately  scal¬ 
ing  the  original  distribution,  such  that  the  new  failure  'rate"  is  of  the  same  order  of  magnitude  as 
the  repair  "rate"  [  2  ].  By  rescheduling  the  failure  of  the  second  component,  we  arc  also  increasing 
the  probability  of  a  system  failure  (i.e..  both  components  unopcrational).  If  the  second  component 
fails  while  the  first  is  in  repair,  we  have  a  system  failure  (this  is  a  stopping  time  for  a  replication 
when  estimating  the  unreliability).  If  the  first  component  is  repaired  before  the  second  component 
fails,  both  components  become  operational  and  we  must  reschedule  their  failures  using  the  original 
distributions.  This  is  crucial  in  order  to  appropriately  move  the  system  only  towards  a  likely  path 
to  failure. 


4.  Dependability  Measures 

In  this  section  wc  discuss  the  estimation  of  some  measures  which  arc  commonly  used  for  the  eval¬ 
uation  of  highly  dependable  systems.  I'hcse  measures  can  be  classified  as  stationary  or  transient. 
Stationary  measures  are  determined  by  the  long-run  (or  steady-state)  behavior  in  repairable  systems; 
they  arc  independent  of  the  initial  state.  The  steady-state  availability  is  a  common  stationary 
measure.  Transient  measures  are  determined  by  the  transient  behavior  in  repairable  and  non¬ 
repayable  systems;  they  depend  on  the  initial  state.  It  is  usually  assumed  that  the  system  starts  with 
all  its  components  fully  operational.  The  system  reliability  and  the  mean  time  to  failure  (MTTF) 
arc  common  transient  measures  which  wc  consider  in  this  section.  Instantaneous  availability,  dis¬ 
tribution  and  expectation  of  interval  availability  are  examples  of  other  transient  measures;  the  esti¬ 
mation  of  these  measures  in  Markovian  models  is  considered  in  [  10  ]. 

As  a  running  example,  we  will  consider  the  machine-repairman  model  (described  in  Section  3)  to 
explain  our  ideas  and  to  numerically  illustrate  the  effectiveness  of  importance  sampling  for  the  es¬ 
timation  of  dependability  measures. 

4.1.  System  Reliability 

In  this  section  wc  consider  the  estimation  of  reliability  in  non-Markovian  discrctc-cvent  systems 
using  simulation  and  importance  sampling.  No  assumptions  are  made  concerning  the  distributions 
of  time  to  failure  and  time  to  repair  of  individual  system  components.  The  system  is  initially  in  a 
state  with  all  its  components  operational  and  as  good  as  new.  Let  Tp  be  the  time  at  which  the 
system  first  enters  a  failure  state.  The  system  reliability  R(t)  is  defined  as  the  probability  that  the 
system  docs  not  fail  in  the  interval  (0,  /),  i.e., 


(4.1) 
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where  /( • )  is  the  indicator  function.  The  method  of  replications  is  the  typical  simulation  method 
for  estimating  the  reliability.  In  each  replication  we  simulate  the  system  until  either  a  failure  occurs 
or  the  time  interval  exceeds  t.  1  £t  n,  be  the  number  of  replications.  The  resulting  estimate  for  the 
unreliability  U(l)  is  given  by 


", 
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where  Tf  is  the  time  to  failure  at  the  /  -  th  replication  and  l(Tr  £  0  is  the  value  of  the  indicator 
function.  Clearly,  if  we  use  standard  simulation  in  a  highly  reliable  system,  then  the  value  of  the 
indicator  function  is  zero  in  all  but  a  few  replications.  A  very  large  number  of  replications  (i.e.,  a 
very  long  simulation)  is  needed  in  order  to  obtain  an  estimate  with  a  tight  confidence  interval.  Im¬ 
portance  sampling,  as  described  in  Section  3,  is  very  effective  in  improving  the  efficiency  of  such 
simulations.  Ixt  N[  be  the  stopping  time  in  the  i  -  th  replication,  i.e.,  the  number  of  internal  state 
transitions  until  either  a  system  failure  or  the  first  transition  to  occur  after  time  f.  The  resulting  es¬ 
timate  is  unbiased  and  is  given  by 


m  =  X  ^  ^  ^xo. N) ' 
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where  /-(X0  ^)  is  the  likelihood  ratio  as  given  by  liquation  (3.3). 

It  is  important  to  mention  that  forcing  [  12  ]  can  be  combined  with  failure  biasing  (described  in 
Section  3)  to  estimate  reliability.  Th:'.,is  particularly  useful  when  l  is  small  so  that  a  failure  is  un¬ 
likely  to  occur  in  the  interval  (0,  t).  In  the  machine-repairman  model  of  Section  3,  forcing  is  ac¬ 
complished  by  scheduling  component  failures  using  the  original  distribution  conditioned  so  that  a 
failure  is  guaranteed  to  occur  before  time  t.  Once  a  failure  occurs,  the  second  component  is  re¬ 
scheduled  using  an  accelerated  distribution.  In  each  replication,  forcing  should  be  done  every  time 
both  components  become  operational. 

In  Markovian  models,  conditioning  out  the  holding  time  in  the  initial  (fully  operational)  state  [  10 
]  has  also  proven  quite  effective  in  improving  the  efficiency  of  simulations  to  estimate  transient 
measures.  However,  extending  this  technique  to  systems  with  general  failure  time  distributions  is 
difficult. 

To  illustrate  the  feasibility  and  effectiveness  of  importance  sampling  (with  rescheduling)  to  estimate 
system  reliability,  we  consider  the  machine- repairman  example  (of  Section  3)  with  a  two-stage 
hyperexponential  failure  and  repair  distributions.  A  two-stage  hyperexponential  failure  time  is 
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generated  from  an  exponential  of  parameter  2,  with  a  probability  qf  and  from  an  exponential  of 
parameter  22  with  a  probability  I  -  qj.  The  parameters  of  the  failure  distribution  are 
qf  -  .9,^|  =  .001  per  hour  and  22  =  01  per  hour.  We  have  selected  relatively  high  failure  rates 
so  that  tandard  simulation  could  provide  us  with  reasonable  estimates  for  the  purpose  of  com¬ 
parison.  The  parameters  of  the  repair  distribution  arc  qr  =  .9,/i,  =  1  per  hour  and /i2  =  10  per 
hour.  For  importance  sampling,  we  use  an  accelerated  failure  distribution  which  is  the  same  as  the 
original  distribution  with  its  rates  scaled  up  (while  other  choices  are  also  possible,  determining  the 
optimal  is  an  open  research  problem).  The  parameters  of  the  accelerated  failure  distribution  are 
q'f  =  .9  ,  A’,  =  .5  per  hour  and  2'2  =  5  per  hour. 

For  the  interval  between  0  and  10  hours,  a  very  accurate  estimate  of  the  unreliability  (.'(10)  is  ob¬ 
tained  numerically  using  the  SAVE  package  [  8  ].  For  the  purpose  of  comparison,  we  have  also 
used  standard  simulation  as  well  as  importance  sampling,  each  for  a  total  of  128000  simulated 
events.  The  numerical  and  simulation  estimates  arc  as  follows  (with  the  90%  half-width  confidence 
interval  as  a  percentage  of  the  point  estimate): 

Numerical:  5.775  x  10~5 

Standard  simulation:  4.737  x  10” 5  +  94.9R% 

Importance  sampling:  5.560  x  10“S  +  7,R9% 

Notice  that  by  using  importance  sampling  we  get  more  than  10  times  improvement  in  the  confi¬ 
dence  interval,  which  is  equivalent  to  more  than  100  times  reduction  in  the  simulation  run-length. 

4.2.  Steady-State  Availability 

The  steady-state  availability  is  defined  as  the  long-run  fraction  of  time  the  system  is  available.  It  is 
typically  used  as  a  metric  for  evaluating  repairable  systems.  In  Markovian  models,  regenerative 
simulations  are  typically  used  to  estimate  the  steady-state  availability  [  2  ]  (the  state  in  which  all 
components  of  the  system  are  operational  is  usually  chosen  as  a  regeneration  point).  As  a  conse¬ 
quence,  a  simple  estimator  for  the  steady-state  unavailability  11 A  follows  from  a  basic  result  of  re¬ 
newal  theory  [  1  J 
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where  D  and  T  are  the  total  "down"  time  and  the  total  "cycle"  time  between  regenerations,  re¬ 
spectively. 

Unfortunately,  in  non-Markovian  models  with  general  failure  and  repair  distributions,  a  regenera¬ 
tive  structure  may  not  be  present  (for  conditions  under  which  a  discrete-event  system  or  a  <1SMP 
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is  regenerate  c  the  reader  is  re' erred  to  [  1!  ]).  Let  us  again  consider  the  machine-repairman  model 
with  general  failure  and  repair  distributions.  We  consider  two  eases  in  which  a  regenerative  structure 
can  be  recognized.  In  the  first  ease,  we  assume  that  the  failure  time  of  individual  components  is 
exponentially  distributed.  Therefore,  a  regeneration  point  is  readily  identified  at  repair  transitions 
after  which  all  components  become  operational.  In  the  second  case,  regenerations  occur  as  a  result 
of  a  periodic  (and  deterministic)  maintenance  on  all  components.  This  is  true  for  general  failure 
and  repair  distributions,  since  after  maintenance  a  component  is  as  good  as  new.  In  this  ease,  a 
regeneration  point  is  identified  at  the  lowest  common  multiple  of  all  maintenance  periods,  provided 
that  no  component  has  failed  since  its  last  maintenance.  At  these  points,  all  components  are  op¬ 
erational  and  the  conditional  distribution  of  the  time  to  failure  of  each  individual  component  is  the 
same  for  all  regenerations  and  is  conditionally  independent  of  the  past. 

If  a  regenerative  structure  can  be  recognized  in  a  discrete-event  system,  then  regenerative  simulation 
can  be  used  to  estimate  the  steady-state  unavailability  by  using  liquation  (4.4).  let  nc  be  the  num¬ 
ber  of  regeneration  cycles  used.  Then  an  estimate  of  VA  is  given  by 
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where  /);  and  7}  arc  the  total  "down"  time  and  the  "cycle"  time  in  the  /'  -  th  regeneration  cycle,  re- 
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spectively.  /.)  and  T  arc  estimates  of  E (D)  and  E(7),  respectively.  Eor  highly  available  systems, 
system  failure  is  a  rare  event.  Therefore,  standard  simulation  is  very  inefficient  for  estimating  the 
numerator  E (D),  since  only  a  very'  small  fraction  of  regeneration  cycles  will  contain  failures.  Again, 
importance  sampling  provides  an  efficient  solution  by  biasing  the  dynamics  of  the  system  appro¬ 
priately,  so  that  a  likely  path  to  failure  is  encountered  more  often.  Notice  that  the  denominator  can 
be  estimated  efficiently  using  standard  simulation;  in  fact,  using  importance  sampling  to  estimate 
the  denominator  E(7)  may  increase  its  variance.  Therefore,  a  better  estimate  for  the  steady-state 
unavailability  can  be  obtained  by  using  measure  specific  dynamic  importance  sampling  (MSD1S) 
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3,  while  independently  using  standard  simulation  to  estimate  the  denominator  E (7).  The  optimal 
allocation  of  the  simulation  run  lengths  for  estimating  the  numerator  and  the  denominator  is  con¬ 
sidered  in  [  10  ].  Notice  that  here,  regeneration  limes  are  used  as  stopping  times.  Let  n„  end  nd 
be  the  number  of  regeneration  cycles  used  to  estimate  the  numerator  and  denominator,  respectively. 
For  the  numerator,  estimates  for  the  mean  and  the  variance  arc  given  by 
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where  at  the  i-th  replication,  Dt  is  the  value  of  the  numerator  and  Nt  is  the  stopping  time. 
L(\0  N)  is  the  associated  likelihood  ratio  as  computed  from  Equation  (3.3).  Similar  equations  hold 
for  the  mean  and  the  variance  of  the  denominator,  T  and  a  (7),  respectively,  except  here  the  like¬ 
lihood  ratio  is  identical  to  one  (since  we  are  using  standard  simulation).  It  follows  that  UA  has  the 
following  estimates  for  its  mean  and  asymptotic  variance  [  10  ]  (for  large  n„  and  nd ): 
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with  p  =  nnl(n„  +  nd). 

Let  us  again  consider  the  machine-repairman  example  in  Section  4.1  to  illustrate  the  effectiveness 
of  importance  sampling  to  estimate  the  steady-state  unavailability.  We  change  the  failure  time  dis¬ 
tribution  to  an  exponential  of  a  parameter  k  =  .001  per  hour;  tnis  is  done  to  obtain  a  regenerative 
system  for  which  we  can  use  a  regenerative  simulation.  For  importance  sampling,  we  use  accelerated 
failures  from  an  exponential  distribution  of  a  parameter  k‘  =  .5  per  hour. 

An  accurate  estimate  of  the  unavailability  UA  is  obtained  numerically  using  the  SAVE  package  [ 
8  ].  We  give  estimates  using  standard  simulation  and  importance  sampling,  each  for  a  total  of 
128000  simulated  events.  The  results  are  as  follows  (with  the  90%  half-width  confidence  interval 
as  a  percentage  of  the  point  estimate): 

Numerical:  1.799  x  10-6 

Standard  simulation:  1.623  x  10-6  ±  29.70% 

Importance  sampling:  Li.  1 7  x  10-6  ±  2.61% 


Again,  we  get  mCit  than  iQ  tunes  iiupiuvcnicnt  in  the  confidence  interval  by  using  importance 
sampling.  In  Section  6  we  present  experimentation  results  for  estimating  the  steady-state  unavail¬ 
ability  in  a  machine-repairman  model  with  periodic  maintenance  and  in  a  large  model  of  a  com¬ 
puting  system.  For  some  experiments,  we  select  typical  failure  rates  in  the  range  of  10  5  to  10-6. 
In  this  range,  standard  simulation  produces  meaningless  results,  while  the  estimates  obtained  using 
importance  sampling  converge  as  quickly  as  those  in  the  above  example. 
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4.3.  Mean  Time  to  Failure  (MTTF) 

MTTF  is  typically  thought  of  as  a  transient  measure,  since  it  depends  on  the  initial  state  of  the 
system.  Assuming  that  the  system  is  initially  in  a  state  with  all  its  components  operational  and  as 
good  as  new,  the  MTTF  is  defined  as  the  expected  time  the  system  first  enters  a  failure  state.  The 
replication  method  of  simulation  is  typically  used  to  estimate  the  MTTF.  Again,  standard  simu¬ 
lation  of  highly  dependable  systems  means  very  long  replications  and,  hence,  excessively  long  sim¬ 
ulation  runs.  When  the  replication  method  of  simulation  is  used,  importance  sampling  may 
actually  increase  the  variance  of  the  MTTF  estimate;  '.his  is  because  a  likely  sample  path  to  failure 
in  the  biased  system  is,  roughly,  much  shorter  (in  terms  of  simulated  time)  than  a  likely  sample  path 
in  the  original  system. 

If  the  initial  state  of  the  system  is  a  regeneration  point,  then  a  ratio  representation  for  the  M  TTF 
is  possible  [  17  ], 


MTTF  = 


F-(t) 
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where  t(=  min (TF,  7))  is  the  minimum  of  the  time  to  system  failure  (7'r)  and  the  cycle  time 
(T).  P(7;  <  7)  is  the  probability  that  a  system  failure  occurs  before  a  regeneration.  In  the  above 
lalio  representation,  both  the  numerator  and  the  denominator  can  be  estimated  using  regenerative 
simulations.  The  numerator  B(t)  can  be  estimated  efficiently  using  standard  simulation.  However, 
in  highly  dependable  systems,  the  denominator  P(7'/  <  7)  is  a  very  small  quantity  hence,  it  can 
be  estimated  much  more  efficiently  using  importance  sampling.  Mere  also,  MSDIS  is  recommended 
for  estimating  the  MTTF,  in  which  the  numerator  and  the  denominator  are  simulated  independ¬ 
ently. 

Unfortunately,  a  regenerative  structure  may  not  be  exhibited  in  a  general  discrete-event  system;  this 
limits  the  validity  of  the  ratio  representation  for  the  M  ITF,  and  hence  the  use  of  importance 
sampling,  to  only  those  systems  in  which  the  initial  state  is  a  regeneration  point. 

I  ct  us  again  consider  the  machine-repairman  model  with  general  failure  and  repair  distributions. 
In  Section  4.2  we  have  recognized  two  case*  in  which  the  system  exhibits  a  regenerative  structure. 
In  particular,  if  the  time  to  failure  of  individual  components  is  exponentially  distributed,  then  the 
initial  state,  with  all  components  operational,  is  a  regeneration  point.  In  this  case,  the  ratio  repre¬ 
sentation  of  the  MTfF  is  valid  and  importance  sampling  can  be  used  to  estimate  V(TF<  7;. 

Again,  the  heuristic  for  importance  sampling  is  as  described  in  Section  3,  except  that  here,  the 
stopping  time  is  either  the  regeneration  time  or  the  time  to  system  failure,  whichever  occurs  first. 
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let  nd  be  tlic  number  of  cycles  used,  and  ;V,  be  the  Mopping  time  in  the  /  -  th  regeneration  cycle. 

A 

The  resulting  estimate  PFt  of  P(7'r<:  7)  is  given  by 


PF,  = 


±  £/(rFi<7))/.(X0>/Vi), 
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where  l(Tf.-  <  Tt)  and  L(X0  N)  are  the  indicator  function  and  the  likelihood  ratio  (from  Equation 
(3.3)),  respectively,  evaluated  in  the  /  -  th  regeneration  cycle.  The  estimate  t  of  E(r)  is  obtained 
independently  using  standard  simulation.  The  resulting  estimates  for  the  mean  and  variance  of  the 
MTTF  are  computed  from  equations  similar  to  Equations  (4.6)  and  (4.7)  for  the  steady-state  una¬ 
vailability  UA. 

Here  also  we  consider  the  tnachinc-rcpairman  example  in  Section  4.2  to  illustrate  the  effectiveness 
of  importance  sampling  to  estimate  the  MTTF.  Notice  that  the  failure  time  distribution  is  assumed 
to  be  exponential  with  a  parameter  ).  =  .001  per  hour.  We  obtain  a  regenerative  system  for  which 
the  ratio  representation  is  valid;  thus  we  can  use  regenerative  simulation  and  importance  sampling. 
Again,  we  use  accelerated  failures  from  an  exponential  distribution  of  a  parameter  X  =  .5  per  hour. 

An  accurate  estimate  of  the  MTTF  is  obtained  numerically  using  the  SAVE  package  [  8  ].  In  the 
following  we  also  give  estimates  using  standard  simulation  and  importance  sampling,  each  for  a 
total  of  128000  simulated  events  (with  the  90%  half- width  confidence  interval  as  a  percentage  of 
the  point  estimate): 

Numerical:  5.510  x  10s 

Standard  simulation:  6.039  x  10s  +  22.60% 

Importance  sampling:  5.450  x  10’+  1.96% 

We  obtain  more  than  10  times  improvement  in  the  confidence  interval  by  using  importance  sam¬ 
pling. 

5.  Implementation  Issues 

In  this  section  we  consider  the  implementation  of  the  variance  reduction  techniques  described  in 
the  previous  sections.  We  have  implemented  these  techniques  using  CSIM  [15,16],  which  is  a 
process-oriented  simulation  language  based  on  the  C  programming  language.  In  a  process-oriented 
simulation,  a  model  is  defined  as  a  collection  of  interacting  processes.  Each  process  is  an  inde¬ 
pendent  program  which  runs  in  parallel  with  the  other  processes,  with  a  main  program  synchro¬ 
nizing  all  of  the  processes  and  controlling  the  interactions  between  them.  For  example,  in  the 
reliability  system  simulations  which  wc  consider  here,  a  separate  process  is  created  for  each  indi- 
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vidual  componciit  of  the  system.  Hach  process  simulates  the  failures  and  repairs  of  its  respective 
component.  In  our  models  solved  with  CSIM,  we  only  consider  steady  state  unavailability,  which 
we  estimate  using  regenerative  simulation. 

We  define  an  up  cycle  to  be  a  segment  of  the  sample  path  between  two  successive  times  when  a 
component  comes  out  of  repair  or  scheduled  maintenance  and  finds  all  other  components  opera¬ 
tional.  As  we  will  see  later,  there  may  be  more  than  one  up  cycle  in  a  regenerative  cycle. 

In  models  of  highly  reliable  systems,  the  repair  rates  of  the  components  are  typically  orders  of 
magnitude  larger  than  the  failure  rates.  A  method  of  implementing  importance  sampling  is  to  re¬ 
schedule  events  in  order  to  bias  the  system  towards  the  failed  state.  This  is  called  failure  biasing. 

When  using  importance  sampling,  we  want  to  cause  the  system  to  fail  using  the  most  likely  path 
to  failure.  This  suggests  using  the  following  strategy  for  implementing  failure  biasing.  After  the  first 
component  failure  in  an  up  cycle,  we  reschedule  all  of  the  other  components'  failure  times  by  gen¬ 
erating  new  remaining  lifetimes  using  specified  biased  distributions.  The  biased  distributions  are 
selected  so  that  the  probability  that  some  operating  component  fails  before  the  component  in  repair 
completes  service  is  in  the  range  of  .1  to  .5,  thus  greatly  increasing  the  probability  of  a  system  fail¬ 
ure.  Until  either  a  system  failure  occurs  or  we  reach  the  end  of  an  up  cycle,  we  continue  to  schedule 
all  failure  lifetimes  using  the  biased  failure  distributions.  Once  we  reach  the  end  of  an  up  cycle, 
we  reschedule  the  remaining  lifetimes  of  all  components  using  the  original  failure  distributions  and 
repeat  the  entire  process.  However,  if  we  reach  the  failed  state  during  the  time  failure  biasing  is 
activated,  we  immediately  reschedule  all  of  the  remaining  lifetimes  of  the  operational  components 
using  the  original  failure  distributions  and  do  not  use  failure  biasing  for  the  rest  of  the  regenerative 
cycle.  By  doing  this,  we  ensure  that  the  probability  of  two  system  failures  occurring  in  one  regen¬ 
erative  cycle  remains  small.  For  continuous  time  Markov  chains  the  discrete  time  conversion  of 
the  above  strategy  was  shown  to  be  an  effective  technique  in  [2,  9]. 

As  an  alternate  approach  to  rescheduling  failures,  one  can  actually  alter  the  rates  at  which  the  clocks 
associated  with  the  lifetimes  of  the  components  advance.  In  order  to  implement  importance  sam¬ 
pling  in  a  manner  similar  to  rescheduling,  we  rescale,  i.c.,  divide  by  a  scaling  factor  r,  the  remaining 
lifetimes  of  the  operational  components  at  precisely  the  same  instances  at  which  clocks  were  re¬ 
scheduled  when  mine  the  reschrdulino  techninuc  The  advantaoe  of  rescalino  docks  is  that  new 
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random  lifetimes  do  not  have  to  be  generated.  In  our  implementation,  we  actually  altered  the  repair 
clock  rate  instead  of  altering  all  of  the  failure  clock  rates.  Since  we  are  assuming  that  there  is  only 
one  repairman,  this  allows  us  to  reschedule  only  one  event,  hence  saving  computational  effort.  In 
order  to  avoid  numerical  problems  with  the  likelihood  ratio,  we  pretended  that  we  actually  changed 
the  failure  clock  rates  and  did  nothing  to  the  repair  clock  rate.  The  resulting  likelihood  ratio  is 
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exactly  the  same  as  in  the  rescheduling  case  when  using  scaled  conditional  distributions  for  the  bi¬ 
ased  distributions. 

In  the  experimental  results  discussed  below,  the  rescaling  technique  was  used  to  implement  im¬ 
portance  sampling  in  all  of  the  models  except  for  the  maintenance  model,  in  which  rescheduling 
was  used.  It  should  be  noted  that  the  amount  of  CPI)  time  needed  to  simulate  a  fixed  number  of 
events  using  importance  sampling  took  an  extra  10%  to  150%  over  standard  simulation,  depending 
on  the  size  of  the  model  solved  and  the  importance  sampling  implementation  used.  However,  the 
extra  computation  time  needed  is  due  to  special  'tricks'  we  had  to  use  in  CSIM  in  order  to  adjust 
the  event  list  of  the  simulator,  and  in  a  different  implementation  where  we  are  able  to  directly  access 
the  event  list,  there  would  be  minimal  extra  cost.  This  observation  is  supported  by  experiments  in 
[  10  ]  when  using  importance  sampling  for  simulating  Markovian  models. 

6.  Examples  and  Discussions 

In  this  section  we  use  three  examples  to  illustrate  the  effectiveness  of  the  proposed  importance 
sampling  techniques.  First,  wc  use  a  small  example  to  experiment  with  some  heuristics  for  selecting 
the  new  probability  distributions  which  make  the  typically  rare  system  failures  occur  more  often. 
Second,  these  heuristics  are  applied  to  a  mode!  of  a  fairly  complex  computing  system  to  demon¬ 
strate  lhat  the  methods  described  in  this  paper  are  effective  and  that  orders  of  magnitude  reduction 
in  variance  can  be  obtained  in  simulations  of  large  models.  We  also  show  that  the  relative  accuracy 
of  our  estimate  of  unavailability  when  using  our  importance  sampling  technique  is  independent  of 
the  magnitude  of  the  unavailability.  We  use  exponential  failure  and  repair  distributions  in  this  ex¬ 
ample  to  ascertain  the  correctness  of  the  results  obtained  by  comparing  them  against  numerical  re¬ 
sults  obtained  from  the  SAVE  package  [  7  ].  In  the  third  example,  we  use  Wcibull  failure 
distribution  and  periodic  maintenance  for  all  individual  components  in  the  system.  We  study  the 
effect  of  the  hazard  rate  (i.c.,  increasing,  decreasing  and  constant  failure  rates)  on  the  optimal 
maintenance  period.  Such  studies  cannot  be  performed  with  existing  analytical  or  numerical 
methods. 


6.1.  Effects  of  Different  Biased  Failure  Distributions 

In  this  section,  we  use  a  small  model  to  analyze  the  behavior  of  the  variance  when  using  our  im¬ 
portance  sampling  technique.  In  particular,  we  examine  the  effect  on  the  stability  and  magnitude 
of  the  estimated  variance  from  how  much  we  bias  the  system  towards  failure  when  using  impor¬ 
tance  sampling. 
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The  model  consists  of  two  types  of  components,  each  having  a  redundancy  of  two.  The  failure 
distributions  of  the  components  arc  exponential,  with  the  failure  rate  denoted  by  ).  There  is  one 
repairman,  who  services  failed  components  in  a  FCFS  fashion,  with  repair  times  being  exponen¬ 
tially  distributed  and  repair  rate  n  =  1 . 

We  now  examine  the  effect  on  the  amount  of  variance  reduction  gained  and  the  stability  of  the 
variance  by  choosing  different  scaling  factors  r.  Table  1  contains  the  values  of  the  variance  of  the 
amount  of  down  time  in  a  regenerative  cycle  after  a  specified  number  of  simulated  events  for  various 
values  of  r  when  k  =  10-3,  and  Table  2  contains  similar  results  when  A  =  I0-5.  If  the  probability 
of  system  failure  is  small,  then  the  variance  of  the  down  time  is  the  dominating  term  in  the  ex¬ 
pression  for  the  variance  of  steady  state  unavailability  when  estimated  as  a  ratio  (see  Equation 
(4.7)).  By  choosing  the  scaling  factor  r  such  that  ;</ 10  <r{n—  l)A  £  ft,  where  n  is  the  total  number 
of  components  in  the  system,  we  obtain  stable  estimates  of  the  variance  quickly.  Also,  for  r  in  this 
range,  wc  obtain  the  largest  amount  of  variance  reduction.  It  is  also  interesting  to  point  out  that 
if  wc  choose  r  too  large,  the  variance  actually  starts  to  increase  and  becomes  less  stable.  The  in¬ 
crease  is  caused  by  the  added  variability  in  the  likelihood  ratio.  The  above  experiment  was  a  useful 
guide  in  selecting  the  scaling  factor  for  larger  models. 


6.2.  A  Large  Model 

In  this  section,  wc  provide  empirical  results  from  a  large  model,  showing  that  the  methods  described 
in  this  paper  are  also  feasible  and  effective  for  larger  systems  than  the  ones  described  above.  Also, 
wc  demonstrate  that  the  relative  size  of  the  confidence  intervals  when  using  importance  sampling 
is  independent  of  the  magnitude  of  the  unavailability  of  the  system,  as  long  as  a  system  failure  is 
still  a  rare  event.  The  system  wc  will  examine  is  based  on  a  model  of  a  fairly  complex  computing 
system  (also  considered  in  [  13  ]),  with  its  block  diagram  shown  in  Figure  1.  The  computing  sys¬ 
tem  is  composed  of  two  sets  of  processors  with  2  processors  per  set,  two  sets  of  controllers  with  2 
controllers  per  set,  and  6  clusters  of  disks,  each  consisting  of  4  disk  units.  In  a  disk  cluster,  data  is 
replicated  so  that  one  disk  can  fail  without  affecting  the  system.  The  "primary”  data  on  a  disk  is 
replicated  such  that  one  third  is  on  each  of  the  other  three  disks  in  the  same  cluster.  'Ihus,  one  disk 
in  each  cluster  can  be  inaccessible  without  losing  access  to  the  data.  The  connectivity  of  the  system 
is  shown  in  Figure  1.  All  failure  time  distributions  and  repair  time  distributions  are  exponential. 
We  examine  the  model  under  two  different  sets  of  failure  rates  in  order  to  show  that  the  relative 
width  of  the  confidence  interval  is  insensitive  to  the  magnitude  of  the  unavailability.  In  the  first  set, 
the  failure  rates  of  processors,  controllers  and  disks  are  assumed  to  be  1/2000.  1/2000  and  1/6000 
per  hour,  respectively.  These  rates  are  much  larger  than  one  typically  would  find  in  the  real  world, 
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but  wc  chose  these  values  so  that  we  could  obtain  stable  estimates  of  both  the  unavailability  and 
its  variance  using  standard  simulation  in  a  reasonable  amount  of  time.  In  the  second  set,  wc  divide 
all  of  the  failure  rates  by  100,  thus  creating  more  realistic  failure  rates  and  causing  the  unavailability 
to  be  even  smaller.  The  repair  rates  for  all  components  is  1  per  hour.  Components  arc  repaired 
by  a  single  repairman  who  repairs  the  components  in  a  FCFS  discipline.  The  system  is  defined  to 
be  operational  if  all  data  is  accessible  to  both  processor  types,  which  means  that  at  least  one 
processor  of  each  type,  one  controller  in  each  set,  and  3  out  of  4  disk  units  in  each  of  the  6  disk 
clusters  are  operational.  We  also  assume  that  operational  components  continue  to  fail  at  the  given 
rates  when  the  system  is  failed. 

Since  all  failure  and  repair  time  distributions  are  exponential,  the  resulting  system  is  a  continuous 
time  Markov  chain.  We  designed  the  system  in  this  manner  so  that  wc  could  obtain  numerical 
(non-simulation)  results  for  the  unavailability  using  the  SAVF  package  [  7,  R  ].  Since  the  system 
has  a  lew  hundred  thousand  states,  only  bounds  could  be  computed  [13].  These  bounds  are  very 
tight  and  typically  do  not  differ  from  the  exact  results  significantly. 

In  Table  3,  we  have  the  estimates  of  unavailability  and  their  90%  confidence  intervals  for  the  dif¬ 
ferent  sets  of  failure  rates  when  using  standard  simulation  and  importance  sampling  after  1 ,024,000 
simulated  events.  When  using  importance  sampling,  the  scaling  factor  r  was  selected  in  a  manner 
analogous  to  the  results  from  the  small  model  example  given  in  Section  6.1.  The  first  row  of  the 
table  contains  the  results  from  using  the  first  set  of  failure  rates.  The  width  of  the  confidence  in¬ 
terval  is  reduced  by  a  factor  of  3  6  by  using  importance  sampling  over  standard  simulation,  which 
translates  into  a  13-fold  improvement  in  run  length.  The  results  from  using  the  second  set  of  failure 
rates  are  given  in  the  second  row  of  the  table.  The  results  from  standard  simulation  are  meaningless 
because  the  variance  had  not  yet  stabilized  by  the  end  of  the  simulation.  However,  the  results  from 
using  importance  sampling  are  quite  accurate,  with  the  size  of  the  relative  90%  confidence  interval 
being  the  same  as  that  with  the  first  set  of  failure  rates  when  using  importance  sampling.  Thus, 
our  importance  sampling  technique  is  relatively  independent  of  the  magnitude  of  unavailability,  and 
as  the  occurrence  of  a  system  failure  becomes  rarer,  the  amount  of  improvement  gained  by  using 
our  importance  sampling  technique  over  standard  simulation  increases,  which  is  a  favorable  con¬ 
clusion. 

6.3.  A  Study  of  Effects  of  Maintenance  Policies 

We  now  demonstrate  the  types  of  studies  that  can  be  made  with  the  aid  of  the  importance  sampling 
schemes  described  in  this  paper.  Wc  examine  a  non-Markovian  model  with  scheduled  periodic 
(deterministic)  maintenances  and  determine  the  effect  of  varying  the  length  of  lime  between  main- 
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tcnanccs  when  component  lifetime  distributions  have  increasing  failure  rate  (U  R),  constant  failure 
rate,  and  decreasing  failure  rale  (DI  R).  Because  of  the  complexity  of  the  model,  analytic  results 
arc  extremely  difficult  to  obtain.  Also,  as  we  will  sec,  since  system  failures  occur  very  rarely, 
standard  simulation  is  very  inefficient,  and  importance  sampling  is  the  only  practical  alternative. 

We  consider  a  simple  maintenance  model  consisting  of  one  type  of  component  with  a  redundancy 
of  two.  The  distribution  of  the  lifetime  of  each  component  is  Weibull,  with  shape  parameter  <x  and 
scale  parameter  /?.  Recall  that  if  a  =  1,  the  component  lifetime  distributions  are  exponential.  Also, 
if  a  >  1,  the  distribution  has  an  increasing  failure  (hazard)  rate,  and  we  have  a  decreasing  failure  rate 
distribution  if  a  <  1.  In  our  experiments,  we  fixed  /?  =  I0”5  and  varied  a.  There  is  one  repairman 
who  fixes  failed  components  in  FCFS  fashion.  'Die  length  of  the  repair  times  is  the  sum  of  a 
constant  c.  plus  an  exponentially  distributed  random  quantity  with  rate  jj.  The  constant  r  corre¬ 
sponds  to  the  travel  time  of  the  repair  man,  In  all  of  our  simulations,  r.  was  2.0  hours  and  the  repair 
rale  n  was  0.5  per  hour.  In  addition,  each  component  has  a  periodic  scheduled  maintenance  every 
d  hours,  where  d  is  deterministic.  One  component  has  its  first  scheduled  maintenance  at  the  be¬ 
ginning  of  the  simulation,  and  the  other  component  has  its  first  scheduled  maintenance  after  d/2 
iimulatcd  hours  have  passed.  Thus,  the  maintenance  cycles  of  the  two  components  are  staggered. 
All  scheduled  maintenances  take  0.5  hours.  Also,  after  a  component  comes  out  of  repair  from  a 
failure,  the  next  scheduled  maintenance  is  skipped,  and  a  maintenance  is  performed  on  a  component 
only  if  the  other  component  is  operational.  A  component  is  considered  to  be  as  good  as  new  im¬ 
mediately  after  completing  a  scheduled  maintenance.  There  is  a  single  repairman,  different  from  the 
one  who  repairs  failed  components,  who  performs  scheduled  maintenances.  The  system  is  consid¬ 
ered  operational  if  at  least  one  component  is  operational,  i.c.,  not  failed  or  in  scheduled  mainte¬ 
nance. 

Figure  2  shows  a  plot  of  the  unavailability  versus  the  time  between  maintenances  (d)  for  the  dif¬ 
ferent  values  of  a.  The  graph  was  constructed  by  running  simulations  using  the  different  parameter 
values,  plotting  the  point  estimates,  and  using  linear  interpolation  between  the  points.  We  ran  all 
the  experiments  long  enough  so  that  the  relative  half-width  of  the  90%  confidence  interval  was  less 
than  10%.  It  is  interesting  to  note  how  smooth  the  curves  are  for  each  of  the  value  of  a,  thus 
demonstrating  the  effectiveness  of  our  importance  sampling  technique.  Also  note  that  as  d  oo, 
the  system  becomes  equivalent  io  one  without  scheduled,  periodic  maintenances.  This  is  demon¬ 
strated  by  observing  that  the  curve  for  a  =  1.0  is  beginning  to  flatten  out  for  d  >  1000. 

The  curves  show  that  when  component  lifetimes  have  exponential  or  DFR  distributions,  perform¬ 
ing  scheduled  maintenances  actually  increases  the  unavailability  of  the  system.  When  a  =  1.0,  the 
component  lifetime  distributions  have  constant  failure  rate,  which  means  that  the  conditional  dis- 
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tribution  of  a  failure  given  that  il  is  greater  than  t  does  not  depend  on  i.  I  hus.  a  component  's  reli¬ 
ability  docs  not  improve  by  performing  a  maintenance  on  it  Actually,  performing  scheduled 
maintenances  increases  the  system's  unavailability,  which  can  be  explained  as  follows  Since  a 
scheduled  maintenance  for  a  component  takes  a  deterministic  amount  of  time,  the  conditional 
probability  of  the  other  component  failing  during  the  maintenance  time  given  that  it  has  lived,  say 
s  units  of  time  already,  is  the  same  for  all  values  of  s  Thus,  by  decreasing  the  time  between  main¬ 
tenances,  wc  are  increasing  the  frequency  with  which  the  system  can  fail  by  having  a  maintenance 
and  then  a  failutc  occurring.  This,  in  tum,  leads  to  the  higher  unavailability.  We  see  similar  results 
when  n  =  0.75.  However,  since  the  component  lifetimes  now  have  DPR  distributions,  the  effect  is 
more  pronounced.  This  is  because  the  conditional  probability  that  a  component  fails  given  that  it 
has  already  lived  t  units  of  time  is  a  decreasing  function  of  t.  Hence,  by  decreasing  the  time  between 
scheduled  maintenances,  wc  not  only  increase  the  frequency  with  which  the  system  can  fail  by 
having  one  component  in  maintenance  and  the  other  failing  during  the  maintenance,  but  also  the 
conditional  probability  of  the  operational  component  failing  during  the  maintenance  of  the  other 
component  also  increases.  Thus,  one  should  not  perform  scheduled  maintenances  on  systems  of 
components  with  DPR  distributions.  When  a  -  1.25,  the  components  have  IPR  lifetime  distrib¬ 
utions.  In  this  case,  the  unavailability  is  large  for  small  values  of  d.  attains  its  minimum  around 
d—  500,  and  then  increases.  When  a.  -  1.5,  the  unavailability  behaves  in  a  similar  manner,  with  its 
minimum  being  attained  around  d  =  100.  Hence,  in  a  maintainable  system  composed  of  compo¬ 
nents  having  IPR  lifetime  distributions,  scheduled  maintenances  should  be  performed  more  fre¬ 
quently  at  higher  component  failure  rates. 


7.  Summary 

In  this  paper  we  have  described  an  approach  for  simulating  models  of  highly  dependable  systems 
with  general  failure  and  repair  time  distributions.  The  approach  combines  importance  sampling 
with  event  rescheduling  in  order  to  obtain  variance  reduction  in  such  rare  event  simulations.  The 
approach  is  general  in  nature  and  allows  us  to  effectively  simulate  a  variety  of  features  commonly 
arising  in  dependability  modeling.  Por  example,  in  this  paper  wc  have  shown  how  the  technique 
can  be  applied  io  systems  wiin  periodic  maintenance.  We  nave  explored  how  the  steady-state 
availability  is  affected  by  the  maintenance  period  and  by  different  failure  time  distributions. 


We  described  some  of  the  trade-offs  involved  in  the  design  of  specific  rescheduling  rules,  and  dem¬ 
onstrated  their  potential  effectiveness  in  simulations  of  systems  with  both  exponential,  and  non- 
exponential  failure  and  repair  time  distributions.  We  found  that  an  effective  method  for  selecting 
the  rescheduling  distribution  is  by  making  the  probability  of  a  failure  transition  in  the  range  from 
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0.1  to  0.5.  In  addition,  wc  used  a  rescaling  of  clock  values  as  an  inexpensive  way  to  implement 
rescheduling.  While  this  can  be  effective  when  the  clocks  have  nearly  constant  hazard  rates,  different 
rescheduling  algorithms  may  be  required  when  the  clock  densities  are  more  general.  The  use  of 
importance  sampling  for  estimating  steady-state  availability  and  MTTF  requires  that  the  underlying 
model  of  the  system  has  a  regenerative  structure.  This  requires  either  exponential  failure  distrib¬ 
utions  or  general  failure  distributions  with  periodic  (deterministic)  maintenance.  On  the  other  hand, 
the  use  of  importance  sampling  for  estimating  transient  measures,  such  as  reliability,  is  completely 
genera!  and  does  not  require  any  assumption  on  the  failure  and  the  repair  processes. 

We  arc  currently  in  the  process  of  implementing  importance  sampling  for  estimating  reliability, 
MTTI;  and  interval  availability  in  large  models  (here,  we  have  only  experimented  with  the  esti¬ 
mation  of  these  measures  in  small  models).  We  are  also  working  on  the  problem  of  estimating  the 
gradient  of  dependability  measures  in  ncn-Markovian  models  using  importance  sampling. 
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I  able  1:  Estimated  variance  of  down  time  in  a  rvrlc  for  9-slate  model  (A  =  10“*) 
using  different  scaling  factors  r 
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2.326  x  10- 


Table  2:  Estimated  variance  of  down  time  in  a  cycle  for  9-state  model  (A  =  10"*) 
using  different  scaling  factors  r 
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Importance 

Sampling 

Scaling 
Tact or  r 

4.055  x  I0“* 

4.121  x  10“* 
±13.6% 

4.124  x  I0*fi 
±3.8% 

10* 

4.000  x  10" 1,1 

6.165  x  10-"’ 
±159.5% 

4.027  x 
±3.7% 

10' 

Table  3:  Estimates  of  unavailability  and  90%  confidence  intervals  for  a  large 
model  (1,021,000  events) 


23 


duk  clutitt  1 


ditk  <luat<r  3 


duk  duMtr  4 


ditk  dull*  4 


.  UNCLASSIFIED 


1*.  REPORT  SECURITY  CLASSIFICATION 


nnpsf;W:i«*rr.c 


2a.  SECURITY  CLASSIFICATION  AUTHORITY 


2b.  OECLASSIFICATICN/ DOWNGRADING  SCHEDU 


4.  PERFORM!, ,G  ORGANIZATION  REPORT  NUMIER(S) 

Technical  Report  No.  53 


MASTER  COPY 


FOR  REPRODUCTION  PURPOSES 


REPORT  DOCUMENTATION  PAGE 


lb.  RESTRICTIVE  MARKING 


3.  DISTRIBUTION  /  AVAILABILITY  OF  REPORT 


Approved  for  public  release; 
distribution  unlimited. 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


hjxm 


6a.  NAME  OF  PERFORMING  ORGANIZATION  «b.  OFFICE  SYMBOL  7j.  NAME  OF  MONITORING  ORGANIZATION 
Dept,  of  Operations  Research  apolkable) 

U.  S.  Army  Research  Office 


7b.  ADDRESS  (Oty,  State,  and  21?  Cot*) 

P.  0.  Box  12211 

Research  Triangle  Park,  NC  27709-2211 


6b.  OFFICE  SYMBOL  9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
(if  appU cat*) 

- 00b3 


to  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM  I  PROJECT 
ELEMENT  NO.  I  NO. 


6c.  ADDRESS  (Oty,  Stat*,  and  21? Code) 

Stanford,  CA  94305-4022 


8a.  NAME  OF  FUNDING  /SPONSORING 
ORGANIZATION 

U.  S.  Army  Research  Office 


8c.  AODRESS  (Crty,  State,  and  21?  Code) 

P.  0.  Box  12211 

Research  Triangle  Park,  NC  27709-2211 


1 1 .  TITLE  (Include  Security  Classification) 

FdSt  simulation  of  dependability  models  with  general  failure,  repair  and  maintenance 
processes 


12.  PERSONAL  AUTHOR(S) 

_  _  Victor  F.  Nicola,  Marvin  K.  Nakayama,  Philip  Heidelberger ,  Ambuj  Goyal 


13a.  TYPE  OF  REPORT  13b.  TIME  COVERED  14.  DATE  OF  REPORT  (fear,  Month.  Day)  115.  PAGE  COUNT 

Technical  FROM _ TO _ _  January  1990  I  24 


16.  SUPPLEMENTARY  NOTATION 

The  view,  opinions  and/or  findings  contained  in  this  report  are  those 
a9£hQr(s) .and  should  not  be  construed  as,  an  official  Department  of  the  Army  position, 

.  nnllrv.  nr  Hpriclnn.  nn  ipcc  an  nooi  cmat-oH  hv  Afhor  rtornTnont-afi  N«  *  r 


17.  COSAT1  CODES  I  IB.  SUBJECT  TERMS  (Continue  on  rover* e  if  necessary  and  identify  by  block  number) 

’IELD  '  GROUP  |  SUB-GROUP  |  fasi  simulation,  dependability  models,  importance  sampling,  non- Markovian 

models,  periodic  maintenance 


19.  ABSTRACT  (Continue  on  reverse  if  necessary  end  Identify  by  block  number) 


(please  see  next  page) 


20.  DISTRIBUTION / AVAILABILITY  Of  ABSTRACT  121.  ABSTRACT  SECURITY  CLASSIFICATION 

□  UNCLASSIFIEDAJNUMITED  □  SAME  AS  RPT  □  OTIC  USERS  I  Unclassified 


22a.  NAME  OF  RESPONSIBLE  INDIVIDUAL  |22b.  TELEPHONE  (Mud*  Area  Coda)  22c  OFFICE  SYMBOL 


DO  FORM  1473,84 MAft 


B3  APR  edition  may  be  used  until  exhausted. 
All  other  editions  are  obsolete. 


UNCLASSIFIED 


UNCLASSIFIED 

MCWWTV  CLAM  FI  CATION  OP  TNI0  PAM 


FAST  SIMULATION  OF  DEPENDABILITY  MODELS 
WITH  GENERAL  FAILURE,  REPAIR 
AND  MAINTENANCE  PROCESSES 

Victor  I\  Nicola  ♦.  Marvin  K.  Nakayatna  *♦, 

Philip  I  Icitlclhcrgcr  *  and  Ainbuj  Goyal  ♦ 

♦  IBM  Research  Division 
TJ.  Watson  Research  ('enter 
P.O.  Box  704 

Yorktown  Heights,  New  York  10598 

Department  of  Operations  Research 
Stanford  University 
Stanford.  California  94.t()5 

ABSTRACT 

The  problem  of  computing  dependability  measures  of  repairable  systems  with  general  failure,  repair 
and  maintenance  processes  is  a  hard  problem  to  solve  in  general  either  by  analytical  or  by  numerical 
methods.  Monte  Carlo  simulation  could  be  used  to  solve  this  problem,  however,  standard  simu¬ 
lation  lakes  a  very  long  time  to  estimate  system  reliability  and  availability  with  reasonable  accuracy 
because  typically  the  system  failure  is  a  rare  event.  When  the  failure  and  repair  time  distributions 
arc  exponential,  impm  lance  sampling  has  been  used  successfully  in  the  past  to  reduce  simulation  run 
lengths.  In  this  paper,  we  extend  the  applicability  of  importance  sampling,  to  non-Markovian 
models  with  general  failure  and  repair  time  distributions.  We  show  that  by  carefully  selecting  a 
heuristic  for  importance  sampling,  orders  of  magnitude  reduction  in  simulation  run-lengths  can  be 
obtained.  We  illustrate  the  effectiveness  of  the  technique  by  modelling  a  large  repairable  computing 
system.  Also,  we  study  the  effect  of  periodic  maintenance  on  systems  with  components  having 
increasing  and  decreasing  failure  rate. 


90  1 


IaI m* 


UNCLASSIFIED 


MOIWTV  CLASSIFICATION  OF  THIS  PAOC 


